Problem Statement: Movie Award Prediction: Given a details of movies we have to predict Movie will Win Award or not¶

Data Science Flow¶

In [ ]:
1. Business Understanding (Project Name)----> (Domain Knowledge)-->Domain Expert

2. Data Requirement (What data is required to complete the project)

3. Data Collection (From Different sources and with Different Tools and Technologies)

4. Data Preparation (EDA/Data Preparation/Data Cleaning/Data Munging)

5. Machine Learning/Predictive Analysis /Data Modelling/Data Mining ( Clean Data + Algorithms= Model)
Generalised Model= Which work well in unseen Data

(Clean Data + Algorithms= Model )

6. Model Evaluation -Evaluation metrics (Test Model)

7. Model Tuning

8. Model Deployment

9. Monitoring

EDA of Movie Award prediction¶

In [ ]:
EDA/ Data Preparation/Data Cleaning Steps

1. Removing Duplicate data

2. Missing Value Treatment

     6 Methods

3. Outlier Treatment

     5 Methods

4. Categorical to Numerical Conversion

5. Numerical to Categorical Conversion(Binning)

6. Feature Scaling

7. Feature Transformation

8. Feature selection
In [ ]:
1  0 to 1   

2  1000 to 2000

3  100000 to 200000
In [ ]:
Visualization: Story Telling 

1. Univariate (1 column) Analysis: Histogram/Boxplot

2. Bivariate (2 Column) Analysis: Line plot/Scatter plot/Bar plot..........

3. Multivariate: (3 Columns) Analsis: Heatmap  (Correlation Analysis / Pairplot (Mutiple scatter plot)
In [ ]:
When to use what

1. 
In [ ]:
1. Import Libraries

2. Reading Dataset

3. Data Preprartion

4. Data Visulaization

Clean Data (Input and output )

5. Split the Data into Training and Testing

6. Apply algorithm Training Data

7. Predict on test Data 

8. Model Evaluation

1. Import Necessary Libraries¶

In [1]:
# Importing all necessary Libaries: Data Science Packages

import numpy as np # numpy used for mathematical operation on array
import pandas as pd  # pandas used for data manipulation on dataframe
import seaborn as sns # seaborn used for data visualization
import matplotlib.pyplot as plt # matplotlib used for data visualization

2. Reading Dataset¶

In [2]:
# Read the data with pandas

df = pd.read_csv("Movie_classification.csv", header=0)
df
Out[2]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
0 20.1264 59.62 0.462 36524.125 138.7 7.825 8.095 7.910 7.995 7.94 527367 YES 109.60 223.840 Thriller 23 494 48000 1
1 20.5462 69.14 0.531 35668.655 152.4 7.505 7.650 7.440 7.470 7.44 494055 NO 146.64 243.456 Drama 42 462 43200 0
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.570 7.495 7.515 7.44 547051 NO 147.88 2022.400 Comedy 38 458 69400 1
3 20.6474 59.36 0.542 38873.890 119.3 6.895 7.035 6.920 7.020 8.26 516279 YES 185.36 225.344 Drama 45 472 66800 1
4 21.3810 59.36 0.542 39701.585 127.7 6.920 7.070 6.815 7.070 8.26 531448 NO 176.48 225.792 Drama 55 395 72400 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 21.2526 78.86 0.427 36624.115 142.6 8.680 8.775 8.620 8.970 6.80 492480 NO 186.96 243.584 Action 27 561 44800 0
502 20.9054 78.86 0.427 33996.600 150.2 8.780 8.945 8.770 8.930 7.80 482875 YES 132.24 263.296 Action 20 600 41200 0
503 21.2152 78.86 0.427 38751.680 164.5 8.830 8.970 8.855 9.010 7.80 532239 NO 109.56 243.824 Comedy 31 576 47800 0
504 22.1918 78.86 0.427 37740.670 162.8 8.730 8.845 8.800 8.845 6.80 496077 YES 158.80 303.520 Comedy 47 607 44000 0
505 20.9482 78.86 0.427 33496.650 154.3 8.640 8.880 8.680 8.790 6.80 518438 YES 205.60 203.040 Comedy 45 604 38000 0

506 rows × 19 columns

In [5]:
# Reading first 5 Rows of the data

df.head()
Out[5]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
0 20.1264 59.62 0.462 36524.125 138.7 7.825 8.095 7.910 7.995 7.94 527367 YES 109.60 223.840 Thriller 23 494 48000 1
1 20.5462 69.14 0.531 35668.655 152.4 7.505 7.650 7.440 7.470 7.44 494055 NO 146.64 243.456 Drama 42 462 43200 0
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.570 7.495 7.515 7.44 547051 NO 147.88 2022.400 Comedy 38 458 69400 1
3 20.6474 59.36 0.542 38873.890 119.3 6.895 7.035 6.920 7.020 8.26 516279 YES 185.36 225.344 Drama 45 472 66800 1
4 21.3810 59.36 0.542 39701.585 127.7 6.920 7.070 6.815 7.070 8.26 531448 NO 176.48 225.792 Drama 55 395 72400 1
In [6]:
# Reading last 5 Rows of the data

df.tail()
Out[6]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
501 21.2526 78.86 0.427 36624.115 142.6 8.68 8.775 8.620 8.970 6.8 492480 NO 186.96 243.584 Action 27 561 44800 0
502 20.9054 78.86 0.427 33996.600 150.2 8.78 8.945 8.770 8.930 7.8 482875 YES 132.24 263.296 Action 20 600 41200 0
503 21.2152 78.86 0.427 38751.680 164.5 8.83 8.970 8.855 9.010 7.8 532239 NO 109.56 243.824 Comedy 31 576 47800 0
504 22.1918 78.86 0.427 37740.670 162.8 8.73 8.845 8.800 8.845 6.8 496077 YES 158.80 303.520 Comedy 47 607 44000 0
505 20.9482 78.86 0.427 33496.650 154.3 8.64 8.880 8.680 8.790 6.8 518438 YES 205.60 203.040 Comedy 45 604 38000 0
In [7]:
# Reading last 5 Rows of the data

df.sample()
Out[7]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.57 7.495 7.515 7.44 547051 NO 147.88 2022.4 Comedy 38 458 69400 1
In [8]:
# Reading random 5 Rows of the data

df.sample(5)
Out[8]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
27 39.1154 71.28 0.462 33591.085 162.3 7.750 7.800 7.70 7.845 6.80 406866 YES 173.92 262.368 Drama 20 488 29600 0
108 22.5604 72.12 0.480 35963.070 170.6 8.640 8.910 8.73 8.850 7.82 473768 NO 171.92 203.168 Action 48 502 39600 1
111 22.0168 75.02 0.453 37301.825 155.1 8.595 8.685 8.50 8.870 8.44 495560 YES 113.12 263.648 Thriller 34 544 45600 0
33 43.0344 71.28 0.462 31669.055 168.5 8.100 8.165 8.02 8.140 7.80 394967 YES 167.24 302.096 Comedy 25 571 26200 0
259 33.1330 62.94 0.353 38007.310 173.5 8.900 9.090 8.84 9.145 8.40 483741 YES 146.04 224.816 Thriller 40 599 60200 0
In [9]:
# Checking the shape of the data

df.shape
Out[9]:
(506, 19)
In [10]:
# Checking the rows of the data

df.shape[0]
Out[10]:
506
In [11]:
# Checking the column of the data

df.shape[1]
Out[11]:
19
In [12]:
#Reading the name of the columns

df.columns
Out[12]:
Index(['Marketing expense', 'Production expense', 'Multiplex coverage',
       'Budget', 'Movie_length', 'Lead_ Actor_Rating', 'Lead_Actress_rating',
       'Director_rating', 'Producer_rating', 'Critic_rating', 'Trailer_views',
       '3D_available', 'Time_taken', 'Twitter_hastags', 'Genre',
       'Avg_age_actors', 'Num_multiplex', 'Collection', 'Tech_Oscar'],
      dtype='object')
In [13]:
df.rename(columns={'Genre':'Genre_imp'})
Out[13]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre_imp Avg_age_actors Num_multiplex Collection Tech_Oscar
0 20.1264 59.62 0.462 36524.125 138.7 7.825 8.095 7.910 7.995 7.94 527367 YES 109.60 223.840 Thriller 23 494 48000 1
1 20.5462 69.14 0.531 35668.655 152.4 7.505 7.650 7.440 7.470 7.44 494055 NO 146.64 243.456 Drama 42 462 43200 0
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.570 7.495 7.515 7.44 547051 NO 147.88 2022.400 Comedy 38 458 69400 1
3 20.6474 59.36 0.542 38873.890 119.3 6.895 7.035 6.920 7.020 8.26 516279 YES 185.36 225.344 Drama 45 472 66800 1
4 21.3810 59.36 0.542 39701.585 127.7 6.920 7.070 6.815 7.070 8.26 531448 NO 176.48 225.792 Drama 55 395 72400 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 21.2526 78.86 0.427 36624.115 142.6 8.680 8.775 8.620 8.970 6.80 492480 NO 186.96 243.584 Action 27 561 44800 0
502 20.9054 78.86 0.427 33996.600 150.2 8.780 8.945 8.770 8.930 7.80 482875 YES 132.24 263.296 Action 20 600 41200 0
503 21.2152 78.86 0.427 38751.680 164.5 8.830 8.970 8.855 9.010 7.80 532239 NO 109.56 243.824 Comedy 31 576 47800 0
504 22.1918 78.86 0.427 37740.670 162.8 8.730 8.845 8.800 8.845 6.80 496077 YES 158.80 303.520 Comedy 47 607 44000 0
505 20.9482 78.86 0.427 33496.650 154.3 8.640 8.880 8.680 8.790 6.80 518438 YES 205.60 203.040 Comedy 45 604 38000 0

506 rows × 19 columns

In [14]:
df.dtypes
Out[14]:
Marketing expense      float64
Production expense     float64
Multiplex coverage     float64
Budget                 float64
Movie_length           float64
Lead_ Actor_Rating     float64
Lead_Actress_rating    float64
Director_rating        float64
Producer_rating        float64
Critic_rating          float64
Trailer_views            int64
3D_available            object
Time_taken             float64
Twitter_hastags        float64
Genre                   object
Avg_age_actors           int64
Num_multiplex            int64
Collection               int64
Tech_Oscar               int64
dtype: object
In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           494 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int64  
 16  Num_multiplex        506 non-null    int64  
 17  Collection           506 non-null    int64  
 18  Tech_Oscar           506 non-null    int64  
dtypes: float64(12), int64(5), object(2)
memory usage: 75.2+ KB
In [15]:
df.isnull()
Out[15]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
0 False False False False False False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 False False False False False False False False False False False False False False False False False False False
502 False False False False False False False False False False False False False False False False False False False
503 False False False False False False False False False False False False False False False False False False False
504 False False False False False False False False False False False False False False False False False False False
505 False False False False False False False False False False False False False False False False False False False

506 rows × 19 columns

In [16]:
df.isnull().sum()
Out[16]:
Marketing expense       0
Production expense      0
Multiplex coverage      0
Budget                  0
Movie_length            0
Lead_ Actor_Rating      0
Lead_Actress_rating     0
Director_rating         0
Producer_rating         0
Critic_rating           0
Trailer_views           0
3D_available            0
Time_taken             12
Twitter_hastags         0
Genre                   0
Avg_age_actors          0
Num_multiplex           0
Collection              0
Tech_Oscar              0
dtype: int64
In [ ]:
Numerical Columns

Mean/Median /Mode ----> not Times Series 

Algorithm Imputations

Time series----> Ffill, bfill, interploation
In [16]:
df.Budget
Out[16]:
0      36524.125
1      35668.655
2      39912.675
3      38873.890
4      39701.585
         ...    
501    36624.115
502    33996.600
503    38751.680
504    37740.670
505    33496.650
Name: Budget, Length: 506, dtype: float64
In [17]:
df["Budget"]
Out[17]:
0      36524.125
1      35668.655
2      39912.675
3      38873.890
4      39701.585
         ...    
501    36624.115
502    33996.600
503    38751.680
504    37740.670
505    33496.650
Name: Budget, Length: 506, dtype: float64
In [18]:
df.Genre
Out[18]:
0      Thriller
1         Drama
2        Comedy
3         Drama
4         Drama
         ...   
501      Action
502      Action
503      Comedy
504      Comedy
505      Comedy
Name: Genre, Length: 506, dtype: object
In [19]:
plt.scatter(df['Marketing expense'],df['Production expense'])
Out[19]:
<matplotlib.collections.PathCollection at 0x39c16f8460>
In [15]:
# Creating the Data Dictionary with first column being datatype.

Data_dict = pd.DataFrame(df.dtypes)
Data_dict
Out[15]:
0
Marketing expense float64
Production expense float64
Multiplex coverage float64
Budget float64
Movie_length float64
Lead_ Actor_Rating float64
Lead_Actress_rating float64
Director_rating float64
Producer_rating float64
Critic_rating float64
Trailer_views int64
3D_available object
Time_taken float64
Twitter_hastags float64
Genre object
Avg_age_actors int64
Num_multiplex int64
Collection int64
Tech_Oscar int64
In [17]:
# identifying the missing values from the dataset.

Data_dict['MissingVal'] = df.isnull().sum()
Data_dict
Out[17]:
0 MissingVal
Marketing expense float64 0
Production expense float64 0
Multiplex coverage float64 0
Budget float64 0
Movie_length float64 0
Lead_ Actor_Rating float64 0
Lead_Actress_rating float64 0
Director_rating float64 0
Producer_rating float64 0
Critic_rating float64 0
Trailer_views int64 0
3D_available object 0
Time_taken float64 12
Twitter_hastags float64 0
Genre object 0
Avg_age_actors int64 0
Num_multiplex int64 0
Collection int64 0
Tech_Oscar int64 0
In [17]:
# Identifying unique values . For this we used nunique() which returns unique elements in the object.

Data_dict['UniqueVal'] = df.nunique()
Data_dict
Out[17]:
0 MissingVal UniqueVal
Marketing expense float64 0 504
Production expense float64 0 76
Multiplex coverage float64 0 81
Budget float64 0 446
Movie_length float64 0 356
Lead_ Actor_Rating float64 0 339
Lead_Actress_rating float64 0 354
Director_rating float64 0 339
Producer_rating float64 0 353
Critic_rating float64 0 74
Trailer_views int64 0 504
3D_available object 0 2
Time_taken float64 12 449
Twitter_hastags float64 0 423
Genre object 0 4
Avg_age_actors int64 0 42
Num_multiplex int64 0 293
Collection int64 0 228
Tech_Oscar int64 0 2
In [18]:
# identifying count of the variable.

Data_dict['Count'] = df.count()
Data_dict
Out[18]:
0 MissingVal UniqueVal Count
Marketing expense float64 0 504 506
Production expense float64 0 76 506
Multiplex coverage float64 0 81 506
Budget float64 0 446 506
Movie_length float64 0 356 506
Lead_ Actor_Rating float64 0 339 506
Lead_Actress_rating float64 0 354 506
Director_rating float64 0 339 506
Producer_rating float64 0 353 506
Critic_rating float64 0 74 506
Trailer_views int64 0 504 506
3D_available object 0 2 506
Time_taken float64 12 449 494
Twitter_hastags float64 0 423 506
Genre object 0 4 506
Avg_age_actors int64 0 42 506
Num_multiplex int64 0 293 506
Collection int64 0 228 506
Tech_Oscar int64 0 2 506

Discriptive Statistics¶

In [17]:
# view the descriptive statistics of the dataset

df.describe()
Out[17]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex Collection Tech_Oscar
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 494.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 92.270471 77.273557 0.445305 34911.144022 142.074901 8.014002 8.185613 8.019664 8.190514 7.810870 449860.715415 157.391498 260.832095 39.181818 545.043478 45057.707510 0.545455
std 172.030902 13.720706 0.115878 3903.038232 28.148861 1.054266 1.054290 1.059899 1.049601 0.659699 68917.763145 31.295161 104.779133 12.513697 106.332889 18364.351764 0.498422
min 20.126400 55.920000 0.129000 19781.355000 76.400000 3.840000 4.035000 3.840000 4.030000 6.600000 212912.000000 0.000000 201.152000 3.000000 333.000000 10000.000000 0.000000
25% 21.640900 65.380000 0.376000 32693.952500 118.525000 7.316250 7.503750 7.296250 7.507500 7.200000 409128.000000 132.300000 223.796000 28.000000 465.000000 34050.000000 0.000000
50% 25.130200 74.380000 0.462000 34488.217500 151.000000 8.307500 8.495000 8.312500 8.465000 7.960000 462460.000000 160.000000 254.400000 39.000000 535.500000 42400.000000 1.000000
75% 93.541650 91.200000 0.551000 36793.542500 167.575000 8.865000 9.030000 8.883750 9.030000 8.260000 500247.500000 181.890000 283.416000 50.000000 614.750000 50000.000000 1.000000
max 1799.524000 110.480000 0.615000 48772.900000 173.500000 9.435000 9.540000 9.425000 9.635000 9.400000 567784.000000 217.520000 2022.400000 60.000000 868.000000 100000.000000 1.000000
In [18]:
plt.hist(df['Marketing expense'])
Out[18]:
(array([439.,  44.,  14.,   1.,   3.,   2.,   0.,   1.,   1.,   1.]),
 array([  20.1264 ,  198.06616,  376.00592,  553.94568,  731.88544,
         909.8252 , 1087.76496, 1265.70472, 1443.64448, 1621.58424,
        1799.524  ]),
 <BarContainer object of 10 artists>)
In [19]:
# get discriptive statistics on "number" datatypes

df.describe(include = ['number'])
Out[19]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex Collection Tech_Oscar
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 494.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 92.270471 77.273557 0.445305 34911.144022 142.074901 8.014002 8.185613 8.019664 8.190514 7.810870 449860.715415 157.391498 260.832095 39.181818 545.043478 45057.707510 0.545455
std 172.030902 13.720706 0.115878 3903.038232 28.148861 1.054266 1.054290 1.059899 1.049601 0.659699 68917.763145 31.295161 104.779133 12.513697 106.332889 18364.351764 0.498422
min 20.126400 55.920000 0.129000 19781.355000 76.400000 3.840000 4.035000 3.840000 4.030000 6.600000 212912.000000 0.000000 201.152000 3.000000 333.000000 10000.000000 0.000000
25% 21.640900 65.380000 0.376000 32693.952500 118.525000 7.316250 7.503750 7.296250 7.507500 7.200000 409128.000000 132.300000 223.796000 28.000000 465.000000 34050.000000 0.000000
50% 25.130200 74.380000 0.462000 34488.217500 151.000000 8.307500 8.495000 8.312500 8.465000 7.960000 462460.000000 160.000000 254.400000 39.000000 535.500000 42400.000000 1.000000
75% 93.541650 91.200000 0.551000 36793.542500 167.575000 8.865000 9.030000 8.883750 9.030000 8.260000 500247.500000 181.890000 283.416000 50.000000 614.750000 50000.000000 1.000000
max 1799.524000 110.480000 0.615000 48772.900000 173.500000 9.435000 9.540000 9.425000 9.635000 9.400000 567784.000000 217.520000 2022.400000 60.000000 868.000000 100000.000000 1.000000
In [21]:
# get discriptive statistics on "objects" datatypes

df.describe(include = ['object'])
Out[21]:
3D_available Genre
count 506 506
unique 2 4
top YES Thriller
freq 279 183
In [21]:
df.mean()
Out[21]:
Marketing expense          92.270471
Production expense         77.273557
Multiplex coverage          0.445305
Budget                  34911.144022
Movie_length              142.074901
Lead_ Actor_Rating          8.014002
Lead_Actress_rating         8.185613
Director_rating             8.019664
Producer_rating             8.190514
Critic_rating               7.810870
Trailer_views          449860.715415
Time_taken                157.391498
Twitter_hastags           260.832095
Avg_age_actors             39.181818
Num_multiplex             545.043478
Collection              45057.707510
Tech_Oscar                  0.545455
dtype: float64
In [22]:
df.median()
Out[22]:
Marketing expense          25.1302
Production expense         74.3800
Multiplex coverage          0.4620
Budget                  34488.2175
Movie_length              151.0000
Lead_ Actor_Rating          8.3075
Lead_Actress_rating         8.4950
Director_rating             8.3125
Producer_rating             8.4650
Critic_rating               7.9600
Trailer_views          462460.0000
Time_taken                160.0000
Twitter_hastags           254.4000
Avg_age_actors             39.0000
Num_multiplex             535.5000
Collection              42400.0000
Tech_Oscar                  1.0000
dtype: float64
In [23]:
plt.hist(df['Time_taken'])
Out[23]:
(array([  2.,   0.,   0.,   0.,   7., 108., 101., 106., 110.,  60.]),
 array([  0.   ,  21.752,  43.504,  65.256,  87.008, 108.76 , 130.512,
        152.264, 174.016, 195.768, 217.52 ]),
 <BarContainer object of 10 artists>)

Missing Value Imputation¶

In [20]:
# calculating the mean of the Time_taken 

df['Time_taken'].mean()
Out[20]:
157.39149797570855
In [21]:
# calculating the mean of the Time_taken  and replace the missing value with mean

df['Time_taken'].fillna(value = df['Time_taken'].mean(), inplace = True)
In [22]:
# View info of Columns of the dataset such as number of entries, name of columns and data type


df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Marketing expense    506 non-null    float64
 1   Production expense   506 non-null    float64
 2   Multiplex coverage   506 non-null    float64
 3   Budget               506 non-null    float64
 4   Movie_length         506 non-null    float64
 5   Lead_ Actor_Rating   506 non-null    float64
 6   Lead_Actress_rating  506 non-null    float64
 7   Director_rating      506 non-null    float64
 8   Producer_rating      506 non-null    float64
 9   Critic_rating        506 non-null    float64
 10  Trailer_views        506 non-null    int64  
 11  3D_available         506 non-null    object 
 12  Time_taken           506 non-null    float64
 13  Twitter_hastags      506 non-null    float64
 14  Genre                506 non-null    object 
 15  Avg_age_actors       506 non-null    int64  
 16  Num_multiplex        506 non-null    int64  
 17  Collection           506 non-null    int64  
 18  Tech_Oscar           506 non-null    int64  
dtypes: float64(12), int64(5), object(2)
memory usage: 75.2+ KB
In [23]:
# checking the null values column wise

df.isnull().sum()
Out[23]:
Marketing expense      0
Production expense     0
Multiplex coverage     0
Budget                 0
Movie_length           0
Lead_ Actor_Rating     0
Lead_Actress_rating    0
Director_rating        0
Producer_rating        0
Critic_rating          0
Trailer_views          0
3D_available           0
Time_taken             0
Twitter_hastags        0
Genre                  0
Avg_age_actors         0
Num_multiplex          0
Collection             0
Tech_Oscar             0
dtype: int64
In [13]:
#Checking the null values of complete dataset

df.isnull().sum().sum()
Out[13]:
0

Dummy Variable Creation¶

In [30]:
# checking the first five rows of the dataset

df.head()
Out[30]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views 3D_available Time_taken Twitter_hastags Genre Avg_age_actors Num_multiplex Collection Tech_Oscar
0 20.1264 59.62 0.462 36524.125 138.7 7.825 8.095 7.910 7.995 7.94 527367 YES 109.60 223.840 Thriller 23 494 48000 1
1 20.5462 69.14 0.531 35668.655 152.4 7.505 7.650 7.440 7.470 7.44 494055 NO 146.64 243.456 Drama 42 462 43200 0
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.570 7.495 7.515 7.44 547051 NO 147.88 2022.400 Comedy 38 458 69400 1
3 20.6474 59.36 0.542 38873.890 119.3 6.895 7.035 6.920 7.020 8.26 516279 YES 185.36 225.344 Drama 45 472 66800 1
4 21.3810 59.36 0.542 39701.585 127.7 6.920 7.070 6.815 7.070 8.26 531448 NO 176.48 225.792 Drama 55 395 72400 1
In [24]:
# Converting non numerical column into numerical usin get dummies method

df = pd.get_dummies(df,columns = ["3D_available","Genre"],drop_first = True)
In [25]:
df.head()
Out[25]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating ... Time_taken Twitter_hastags Avg_age_actors Num_multiplex Collection Tech_Oscar 3D_available_YES Genre_Comedy Genre_Drama Genre_Thriller
0 20.1264 59.62 0.462 36524.125 138.7 7.825 8.095 7.910 7.995 7.94 ... 109.60 223.840 23 494 48000 1 1 0 0 1
1 20.5462 69.14 0.531 35668.655 152.4 7.505 7.650 7.440 7.470 7.44 ... 146.64 243.456 42 462 43200 0 0 0 1 0
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.570 7.495 7.515 7.44 ... 147.88 2022.400 38 458 69400 1 0 1 0 0
3 20.6474 59.36 0.542 38873.890 119.3 6.895 7.035 6.920 7.020 8.26 ... 185.36 225.344 45 472 66800 1 1 0 1 0
4 21.3810 59.36 0.542 39701.585 127.7 6.920 7.070 6.815 7.070 8.26 ... 176.48 225.792 55 395 72400 1 0 0 1 0

5 rows × 21 columns

In [26]:
# checking the number of rows and columns after converting complete dataset into numerical

df.shape
Out[26]:
(506, 21)
In [1]:
21*21
Out[1]:
441

Data Visualization¶

In [17]:
# Visulaizing the Pairplot of complete dataset

sns.pairplot(df)
Out[17]:
<seaborn.axisgrid.PairGrid at 0x66f3eccb70>
In [34]:
# calculating the correlation of complete dataset

corr = df.corr()
corr
Out[34]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating ... Time_taken Twitter_hastags Avg_age_actors Num_multiplex Collection Tech_Oscar 3D_available_YES Genre_Comedy Genre_Drama Genre_Thriller
Marketing expense 1.000000 0.406583 -0.420972 -0.219247 0.352734 0.380050 0.379813 0.380069 0.376462 -0.184985 ... 0.025694 0.013518 0.059204 0.383298 -0.389582 -0.013417 -0.086805 0.066796 -0.016894 -0.037123
Production expense 0.406583 1.000000 -0.763651 -0.391676 0.644779 0.706481 0.707956 0.707566 0.705819 -0.251565 ... 0.015773 -0.000839 0.055810 0.707559 -0.484754 -0.024404 -0.115401 0.086958 -0.026590 -0.098976
Multiplex coverage -0.420972 -0.763651 1.000000 0.302188 -0.731470 -0.768589 -0.769724 -0.769157 -0.764873 0.145555 ... 0.035515 0.004882 -0.092104 -0.915495 0.429300 -0.004017 0.073903 -0.068554 0.046393 0.037772
Budget -0.219247 -0.391676 0.302188 1.000000 -0.240265 -0.208464 -0.203981 -0.201907 -0.205397 0.232361 ... 0.040439 0.030674 -0.064694 -0.282796 0.696304 -0.027148 0.163774 -0.052579 -0.004195 0.046251
Movie_length 0.352734 0.644779 -0.731470 -0.240265 1.000000 0.746904 0.746493 0.747021 0.746707 -0.217830 ... -0.019820 0.009380 0.075198 0.673896 -0.377999 0.016291 0.005101 0.092693 0.003452 -0.088609
Lead_ Actor_Rating 0.380050 0.706481 -0.768589 -0.208464 0.746904 1.000000 0.997905 0.997735 0.994073 -0.169978 ... 0.038050 0.014463 0.036794 0.706331 -0.251355 -0.035309 -0.025208 0.044592 -0.035171 -0.030763
Lead_Actress_rating 0.379813 0.707956 -0.769724 -0.203981 0.746493 0.997905 1.000000 0.998097 0.994003 -0.165992 ... 0.037975 0.010239 0.038005 0.708257 -0.249459 -0.040356 -0.020056 0.046974 -0.038965 -0.030566
Director_rating 0.380069 0.707566 -0.769157 -0.201907 0.747021 0.997735 0.998097 1.000000 0.994126 -0.166638 ... 0.035881 0.010077 0.041470 0.709364 -0.246650 -0.035768 -0.020195 0.046268 -0.033510 -0.033634
Producer_rating 0.376462 0.705819 -0.764873 -0.205397 0.746707 0.994073 0.994003 0.994126 1.000000 -0.167003 ... 0.028695 0.005850 0.032542 0.703518 -0.248200 -0.043612 -0.020022 0.051274 -0.031696 -0.033829
Critic_rating -0.184985 -0.251565 0.145555 0.232361 -0.217830 -0.169978 -0.165992 -0.166638 -0.167003 1.000000 ... -0.014762 -0.023655 -0.049797 -0.128769 0.341288 -0.001084 0.039235 -0.015253 0.057177 -0.037129
Trailer_views -0.443457 -0.591657 0.581386 0.602536 -0.589318 -0.490267 -0.487536 -0.486452 -0.487911 0.228641 ... 0.074517 -0.006704 -0.049726 -0.544100 0.720119 -0.075783 0.090664 -0.106439 -0.000179 0.109849
Time_taken 0.025694 0.015773 0.035515 0.040439 -0.019820 0.038050 0.037975 0.035881 0.028695 -0.014762 ... 1.000000 -0.006382 0.072049 -0.056704 0.110005 -0.063753 -0.024431 0.012908 0.049285 -0.098138
Twitter_hastags 0.013518 -0.000839 0.004882 0.030674 0.009380 0.014463 0.010239 0.010077 0.005850 -0.023655 ... -0.006382 1.000000 -0.004840 0.006255 0.023122 0.077333 -0.066012 0.034407 0.036442 -0.058431
Avg_age_actors 0.059204 0.055810 -0.092104 -0.064694 0.075198 0.036794 0.038005 0.041470 0.032542 -0.049797 ... 0.072049 -0.004840 1.000000 0.078811 -0.047426 0.040581 -0.013581 -0.030584 -0.015918 -0.036611
Num_multiplex 0.383298 0.707559 -0.915495 -0.282796 0.673896 0.706331 0.708257 0.709364 0.703518 -0.128769 ... -0.056704 0.006255 0.078811 1.000000 -0.391729 -0.004857 -0.052262 0.070720 -0.035126 -0.048863
Collection -0.389582 -0.484754 0.429300 0.696304 -0.377999 -0.251355 -0.249459 -0.246650 -0.248200 0.341288 ... 0.110005 0.023122 -0.047426 -0.391729 1.000000 0.154698 0.182867 -0.077478 0.036233 0.071751
Tech_Oscar -0.013417 -0.024404 -0.004017 -0.027148 0.016291 -0.035309 -0.040356 -0.035768 -0.043612 -0.001084 ... -0.063753 0.077333 0.040581 -0.004857 0.154698 1.000000 0.070371 0.021134 0.061414 -0.072842
3D_available_YES -0.086805 -0.115401 0.073903 0.163774 0.005101 -0.025208 -0.020056 -0.020195 -0.020022 0.039235 ... -0.024431 -0.066012 -0.013581 -0.052262 0.182867 0.070371 1.000000 0.004617 0.035491 0.017341
Genre_Comedy 0.066796 0.086958 -0.068554 -0.052579 0.092693 0.044592 0.046974 0.046268 0.051274 -0.015253 ... 0.012908 0.034407 -0.030584 0.070720 -0.077478 0.021134 0.004617 1.000000 -0.323621 -0.500192
Genre_Drama -0.016894 -0.026590 0.046393 -0.004195 0.003452 -0.035171 -0.038965 -0.033510 -0.031696 0.057177 ... 0.049285 0.036442 -0.015918 -0.035126 0.036233 0.061414 0.035491 -0.323621 1.000000 -0.366563
Genre_Thriller -0.037123 -0.098976 0.037772 0.046251 -0.088609 -0.030763 -0.030566 -0.033634 -0.033829 -0.037129 ... -0.098138 -0.058431 -0.036611 -0.048863 0.071751 -0.072842 0.017341 -0.500192 -0.366563 1.000000

21 rows × 21 columns

In [35]:
# Visulaizing the heatmap of complete dataset

sns.heatmap(corr)
Out[35]:
<AxesSubplot:>

X-y split (Input-Output Split)¶

In [27]:
# Separating the output from the dataset

X = df.loc[:,df.columns!="Tech_Oscar"]
type(X)
Out[27]:
pandas.core.frame.DataFrame
In [28]:
# Checking the first fove rows of the input columns

X.head()
Out[28]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex Collection 3D_available_YES Genre_Comedy Genre_Drama Genre_Thriller
0 20.1264 59.62 0.462 36524.125 138.7 7.825 8.095 7.910 7.995 7.94 527367 109.60 223.840 23 494 48000 1 0 0 1
1 20.5462 69.14 0.531 35668.655 152.4 7.505 7.650 7.440 7.470 7.44 494055 146.64 243.456 42 462 43200 0 0 1 0
2 20.5458 69.14 0.531 39912.675 134.6 7.485 7.570 7.495 7.515 7.44 547051 147.88 2022.400 38 458 69400 0 1 0 0
3 20.6474 59.36 0.542 38873.890 119.3 6.895 7.035 6.920 7.020 8.26 516279 185.36 225.344 45 472 66800 1 0 1 0
4 21.3810 59.36 0.542 39701.585 127.7 6.920 7.070 6.815 7.070 8.26 531448 176.48 225.792 55 395 72400 0 0 1 0
In [29]:
# Checking the shape of the input dataset

X.shape
Out[29]:
(506, 20)
In [30]:
# Creating output column

y = df["Tech_Oscar"]
type(y)
Out[30]:
pandas.core.series.Series
In [31]:
# Checking teh First five rows of the output

y.head()
Out[31]:
0    1
1    0
2    1
3    1
4    1
Name: Tech_Oscar, dtype: int64
In [30]:
# Checking the number of rows and coliumns of the output

y.shape
Out[30]:
(506,)

Test-Train Split¶

In [40]:
# Importing the tarin test split package

from sklearn.model_selection import train_test_split
In [42]:
# Separating the Training and testing Data

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=0)
In [43]:
# Checking first five rows of input training dataset

X_train.head()
Out[43]:
Marketing expense Production expense Multiplex coverage Budget Movie_length Lead_ Actor_Rating Lead_Actress_rating Director_rating Producer_rating Critic_rating Trailer_views Time_taken Twitter_hastags Avg_age_actors Num_multiplex Collection 3D_available_YES Genre_Comedy Genre_Drama Genre_Thriller
220 27.1618 67.40 0.493 38612.805 162.0 8.485 8.640 8.485 8.670 8.52 480270 174.68 224.272 23 536 53400 0 0 0 1
71 23.1752 76.62 0.587 33113.355 91.0 7.280 7.400 7.290 7.455 8.16 491978 200.68 263.472 46 400 43400 0 0 0 0
240 22.2658 64.86 0.572 38312.835 127.8 6.755 6.935 6.800 6.840 8.68 470107 204.80 224.320 24 387 54000 1 1 0 0
6 21.7658 70.74 0.476 33396.660 140.1 7.065 7.265 7.150 7.400 8.96 459241 139.16 243.664 41 522 45800 1 0 0 1
417 538.8120 91.20 0.321 29463.720 162.6 9.135 9.305 9.095 9.165 6.96 302776 172.16 301.664 60 589 20800 1 0 0 0
In [44]:
# Checking the number of rows and columns of input training dataset

X_train.shape
Out[44]:
(404, 20)
In [ ]:
 
In [45]:
# Checking the number of rows and columns of input testing dataset

X_test.shape
Out[45]:
(102, 20)
In [ ]:
Data 

Rows= Observations/Record/Sample/Tuple

Columns= Attribute/Variable/Features
In [ ]:
Algorithmns/Computer = Numbers

A- 1000 to 100000

B - 0 to 10

C 0 and 1

Rescaling the Data using Standardization¶

In [46]:
# Importing the standardscaler package for standardization

from sklearn.preprocessing import StandardScaler
In [47]:
# Applying the standardscaler to training data

sc = StandardScaler().fit(X_train)
In [48]:
# Transforming the training data into standard

X_train_std = sc.transform(X_train)
In [49]:
X_train_std
Out[49]:
array([[-0.37257438, -0.70492455,  0.42487874, ..., -0.66547513,
        -0.48525664,  1.3293319 ],
       [-0.39709866, -0.04487755,  1.24185891, ..., -0.66547513,
        -0.48525664, -0.75225758],
       [-0.402693  , -0.88675963,  1.11148974, ...,  1.50268577,
        -0.48525664, -0.75225758],
       ...,
       [-0.39805586, -0.15941933,  0.0772276 , ..., -0.66547513,
        -0.48525664, -0.75225758],
       [-0.38842357, -0.60326872,  0.93766417, ..., -0.66547513,
        -0.48525664,  1.3293319 ],
       [-0.39951258, -1.01275558,  0.3988049 , ..., -0.66547513,
        -0.48525664,  1.3293319 ]])
In [ ]:
Clean Data ---> Training and Testing

Clean Data (Training Data ) +Algorithm = Model

Testing Data ---> Model =Prediction
In [ ]:
Clean Data ---> Algorith 

(EDA--->Clean Data) + Algorithm (ML/DL) = Model
In [50]:
# Transforming the input testing data 

X_test_std = sc.transform(X_test)
In [51]:
X_test_std
Out[51]:
array([[-0.40835869, -1.12872913,  0.83336883, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [ 0.71925111,  0.9988844 , -0.65283979, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [-0.40257488,  0.39610829,  0.05115377, ...,  1.50268577,
        -0.48525664, -0.75225758],
       ...,
       [-0.3982601 , -0.85812418,  0.89420778, ..., -0.66547513,
        -0.48525664,  1.3293319 ],
       [-0.39934279, -0.07637654,  0.58132175, ...,  1.50268577,
        -0.48525664, -0.75225758],
       [-0.40088071, -0.36702631,  0.31189212, ..., -0.66547513,
        -0.48525664, -0.75225758]])

Training SVM¶

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [52]:
# Importing the SVM Classifier

from sklearn import svm
In [53]:
# Applying the SVM Classifier to training data

clf_svm_l = svm.SVC(kernel='linear', C=100)
clf_svm_l.fit(X_train_std, y_train)
Out[53]:
SVC(C=100, kernel='linear')
In [ ]:
Recall= TP/TP+FN =1 

Precision= TP/TP+FP =1 


TP

TN

FP

FN
In [ ]:
 
In [ ]:
 

Predict values using trained model¶

In [54]:
# Predicting the Values from trainimg and testing

#y_train_pred = clf_svm_l.predict(X_train_std)
y_test_pred = clf_svm_l.predict(X_test_std)
In [55]:
# Checking the predicting values


y_test_pred
Out[55]:
array([1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0], dtype=int64)

Model Performance¶

In [56]:
# Importing accuracy-score and confusion_matrix pacakge

from sklearn.metrics import accuracy_score, confusion_matrix
In [57]:
# Checking the confision matrix

confusion_matrix(y_test, y_test_pred)
Out[57]:
array([[25, 19],
       [25, 33]], dtype=int64)
In [58]:
# Checking the accuracy Score

accuracy_score(y_test, y_test_pred)
Out[58]:
0.5686274509803921
In [49]:
# Checking the parameter of SVM

clf_svm_l.n_support_
Out[49]:
array([144, 146])

Grid Search¶

In [50]:
# Importing the Hyperparameter optimization Gridsearchcv pacakges

from sklearn.model_selection import GridSearchCV 
In [51]:
# Setting the different hyperparamter value C 

params = {'C':(0.001,0.005,0.01,0.05, 0.1, 0.5, 1, 5, 10, 50,100,500,1000)} 
In [52]:
# Creating objevt of SVC Classifier

clf_svm_l = svm.SVC(kernel='linear')
In [53]:
# Applying hyperparameter GridSearchCV 

svm_grid_lin = GridSearchCV(clf_svm_l, params, n_jobs=-1,
                            cv=10, verbose=1, scoring='accuracy') 
In [54]:
# Applying training data to GridSearchCV

svm_grid_lin.fit(X_train_std, y_train)
Fitting 10 folds for each of 13 candidates, totalling 130 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 130 out of 130 | elapsed:  2.2min finished
Out[54]:
GridSearchCV(cv=10, estimator=SVC(kernel='linear'), n_jobs=-1,
             param_grid={'C': (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50,
                               100, 500, 1000)},
             scoring='accuracy', verbose=1)
In [55]:
# Checking the best parameter of SVM 
svm_grid_lin.best_params_ 
Out[55]:
{'C': 0.1}
In [56]:
# Applying the best parameter value to classifier

linsvm_clf = svm_grid_lin.best_estimator_
In [57]:
# Checking the accurcay score on best parameter value

accuracy_score(y_test, linsvm_clf.predict(X_test_std))
Out[57]:
0.5980392156862745

Polynomial¶

In [58]:
# Applying SVC Classifier with Polynomial Kernel

clf_svm_p3 = svm.SVC(kernel='poly', degree=2, C=0.1)
clf_svm_p3.fit(X_train_std, y_train)
Out[58]:
SVC(C=0.1, degree=2, kernel='poly')
In [59]:
# predicting the training and testing data values

y_train_pred = clf_svm_p3.predict(X_train_std)
y_test_pred = clf_svm_p3.predict(X_test_std)
In [60]:
# Checking the acuuracy score

accuracy_score(y_test, y_test_pred)
Out[60]:
0.5588235294117647
In [61]:
# Checking the parameter values

clf_svm_p3.n_support_
Out[61]:
array([185, 194])

Radial¶

In [62]:
# Applying SVC Classifier with rbf Kernel

clf_svm_r = svm.SVC(kernel='rbf', gamma=0.5, C=10)
clf_svm_r.fit(X_train_std, y_train)
Out[62]:
SVC(C=10, gamma=0.5)
In [63]:
# Predicting the training and testing data values

y_train_pred = clf_svm_r.predict(X_train_std)
y_test_pred = clf_svm_r.predict(X_test_std)
In [64]:
# Checking the accurcay score

accuracy_score(y_test, y_test_pred)
Out[64]:
0.6176470588235294
In [65]:
# Checking the Parameter values

clf_svm_r.n_support_
Out[65]:
array([186, 218])

Radial Grid¶

In [66]:
# Applying the different Hyperparameter values of C and gamma

params = {'C':(0.01,0.05, 0.1, 0.5, 1, 5, 10, 50), 
          'gamma':(0.001, 0.01, 0.1, 0.5, 1)} 
In [67]:
# Applying SVC Classifier with rbf kernel

clf_svm_r = svm.SVC(kernel='rbf')
In [68]:
# creating objevt with the Hyperparameter GridSearchCV with different values

svm_grid_rad = GridSearchCV(clf_svm_r, params, n_jobs=-1,
                            cv=3, verbose=1, scoring='accuracy') 
In [69]:
# Applying the hyperparameter to training and testing data

svm_grid_rad.fit(X_train_std, y_train)
Fitting 3 folds for each of 40 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    0.7s finished
Out[69]:
GridSearchCV(cv=3, estimator=SVC(), n_jobs=-1,
             param_grid={'C': (0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50),
                         'gamma': (0.001, 0.01, 0.1, 0.5, 1)},
             scoring='accuracy', verbose=1)
In [70]:
# Checking best values of Hyperparamters

svm_grid_rad.best_params_ 
Out[70]:
{'C': 50, 'gamma': 0.001}
In [71]:
# Checking the best estimator

radsvm_clf = svm_grid_rad.best_estimator_
In [72]:
# Checking the accuracy score with best Hyperparamter values

accuracy_score(y_test, radsvm_clf.predict(X_test_std))
Out[72]:
0.6176470588235294
In [ ]: